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Abstract 

In this paper we consider the problem of RNA folding with pseudoknots. We use a graphical 
representation in which the secondary structures are described by planar diagrams. Pseudoknots 
are identified as non-planar diagrams. We analyze the non-planar topologies of RNA structures 
and propose a classification of RNA pseudoknots according to the minimal genus of the surface on 
which the RNA structure can be embedded. This classification provides a simple and natural way to 
tackle the problem of RNA folding prediction in presence of pseudoknots. Based on that approach, 
we describe a Monte Carlo algorithm for the prediction of pseudoknots in an RNA molecule. 
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1 Introduction 



In recent years the quest for an algorithm which can predict the spatial structure of an RNA molecule 
given its chemical sequence has received considerable attention from molecular biologists 1 . In fact 
the three-dimensional structure of an RNA molecule is intimately connected to its specific biological 
function in the cell (e.g. for protein synthesis and transport, catalysis, chromosome replication and 
regulation) It is determined by the sequence of nucleotides along the sugar-phosphate backbone of 
the RNA. The chemical formula or sequence of covalently linked nucleotides along the molecule from 
the 5' to the 3' end is called the primary structure. The four basic types of nucleotides are adenine 
(A), cytosine (C), guanine (G) and uracil (U), but it is known that modified bases may appear [3]. 

At high enough temperatures, or under high-denaturant conditions RNA molecules have the three- 
dimensional structure of a free single-stranded swollen polymer. At room temperature, different 
nucleotides can pair by means of saturating hydrogen bonds. The standard Watson-Crick pairs are 
A»U and C»G with two and three hydrogen bonds respectively, whereas G»U is a wobble pair with 
two hydrogen bonds. Comparative methods showed that "non-canonical" pairings are also possible 
[H, as well as higher-order interactions such as triplets, or quartets. In this paper we will consider 
only canonical base-pair interactions. Adjacent base pairs can stack, providing and additional binding 
energy which is actually the origin of the formation of stable A-form helices, one of the main structural 
characteristics of folded RNAs. Helices may embed unpaired sections of RNA, in the form of hairpins, 
loops and bulges. It is all these pairings, stackings of bases and structural motifs which bring the RNA 
into its folded three-dimensional configuration. One of the main open problems of molecular biology 
is the prediction of the actual spatial molecular structure of RNA (i.e. its shape) given its primary 
structure. 

As we shall see in Section [21 it is possible to define secondary structures of RNA as structures 
in which the pairings between canonical base pairs do not cross in a certain representation (planar 
graphs). One can also define the tertiary structure of RNA which is the actual three-dimensional 
arrangement of the base sequence. This classification corresponds to the fact that the secondary 
structure of RNA carries the main contribution to the free energy of a fully folded RNA configuration, 
including also some of the sterical constraints. For that reason one can attempt to describe the folding 
process hierarchically [21 03 El d- However, since the secondary structure describes just the topology of 
binary contacts of the bases, most of the information about distances in real three-dimensional space 
is lost. The importance of the secondary structure relies in the fact that it may provide the "skeleton" 
of the final tertiary structure. 

Over the past twenty years several algorithms have been proposed for the prediction of RNA 
folding. They are based on: deterministic or stochastic minimization of a free energy function 0121) 
phylogenetic comparison ^3 El E] , kinetic folding El El El , maximal weighted matching 
method ^El> and several others (for a survey see JH1)- It is fair to say that despite the large number 
of tools available for the prediction of RNA structures, no reliable algorithms exist for the prediction 
of the full tertiary structure of RNA. Most of the algorithms listed above deal with the prediction 
of just the RNA secondary structure. To describe the full folding it is important to introduce the 
concept of RNA pseudoknot [20]. One says that two base pairs form a pseudoknot when the parts 
of the RNA sequence spanned by those two base pairs are neither disjoint, nor have one contained 
in the other. Thus RNA secondary structures without pseudoknots can be represented by planar 
diagrams, whereas RNA with pseudoknots appear when two base pairs can "cross", leading to non- 
planar diagrams (a more precise definition is given in the next Section). Pseudoknots play an important 
role in natural RNAs [2^, for structural, regulatory and catalytic functions. Pseudoknots are excluded 
in the definition of RNA secondary structure and many authors consider them as part of the tertiary 
structure. This restriction is due to the fact that RNA secondary structures without pseudoknots 
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can be predicted easily. One should also note that pseudoknots very often involve base-pairing from 
distant parts of the RNA, and are thus quite sensitive to the ionic strength of the solution. It has been 
shown that the number of pseudoknots depends on the concentration of Mg ++ ion, and can be strongly 
suppressed by decreasing the ionic strength (thus enhancing electrostatic repulsion) j221 1231 EH • The 
most popular and successful technique for predicting secondary structures is dynamic programming 
[HI [2H1 CHI 123 I2E1 I2H E01 EZU > f°r which the memory and CPU requirements scale with the sequence 
length L as 0(L 2 ) and 0(L 3 ), respectively. 

Recently, new deterministic algorithms that deal with pseudoknots have been formulated [H21 E31 
EU EH1 EH1 E3 EH EH HH- In this case the memory and CPU requirements generally scale as 0(L ) 
and 0(L 6 ) respectively (0(L 4 ) and 0(L 5 ) in |35j . or 0(L 4 ) and 0(L 3 ) for a restricted model in 
|34j ) . which can be a very demanding computational effort even for short RNA sequences (L ~ 100). 
Moreover, the main limitation of these algorithms is the lack of precise experimental informations 
about the contribution of pseudoknots to the RNA free energy, which is often excluded a priori in 
the data analysis (as also pointed out in |16 | I41 [ l4*2j). The increase of computational complexity does 
not come as a surprise. In fact the RNA-folding problem with pseudoknots has been proven to be 
NP-complete for some classes of pseudoknots [34=1 135) . For that reason, stochastic algorithms might 
be a better choice to predict secondary structures with pseudoknots in a reasonable time and for long 
enough sequences. 

In [HI I43| 1441 145] stochastic Monte Carlo algorithms for the prediction of RNA pseudoknots have 
been proposed. In these stochastic approaches, the very irregular structure of the energy landscape 
(glassy-like) is the main obstacle: configurations with small differences in energy may be separated by 
high energy barriers, and the system may very easily get trapped in metastable states. Among the 
stochastic methods, the direct simulation of the RNA-folding dynamics (including pseudoknots) with 
kinetic folding algorithms |161 117] is most successful. This technique allows to describe the succession 
of secondary structures with pseudoknots during the folding process. The approach we follow in this 
paper is close in spirit to that one, with a stronger emphasis on the topological character of the RNA 
pseudoknots. It is based on a correspondence (first noticed by E. Rivas and S.R. Eddy in [32]) between 
a graphical representation of RNA secondary structures with pseudoknots and Feynman diagrams. 
In |32] the authors consider only a particular class of pseudoknots. Along the same direction, the 
authors of [HE] made the correspondence between RNA folding and Feynman diagrams more explicit by 
formulating a matrix field theory model whose Feynman diagrams give exactly all the RNA secondary 
structures with pseudoknots. The remarkable facts of this new approach is that it provides an analytic 
tool for the prediction of pseudoknots, and all the diagrams appear to be naturally organized in a 
series of terms, called the topological expansion, where the first term corresponds to planar secondary 
structures without pseudoknots, and higher-order terms correspond to structures with pseudoknots. 

In this paper we explore in more detail this topological expansion and its potential predictive power. 
We also propose a numerical stochastic algorithm for dealing with this expansion in a systematic way, 
which in principle allows the prediction of all kinds of RNA pseudoknots. The paper is organized as 
follows. In Section |2] we review some well-known graphical representation of RNA structures, with 
special emphasis on the so-called "disk diagram" representation. In such a representation one can 
uniquely associate to each RNA secondary structure with (or without) pseudoknots, a circle diagram 
which is planar (or not planar, respectively). In SectionEJwe show how one can characterize the "degree 
of non-planarity" of a given disk diagram. In fact, one can always associate an integer number to each 
RNA disk diagram, called the genus, and we will describe its topological meaning and information 
content. We thus propose to classify RNA pseudoknots according to their genus. Following this idea, 
in Section 0] we generalize the standard thermodynamic model for the description of RNA structures 
to the inclusion of pseudoknots. The generalized model we propose is very natural, in the same spirit 
when going from the Canonical Ensemble to the Grand Canonical Ensemble in statistical mechanics. 
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Figure 1: An RNA configuration without pseudoknots (left column) and RNA configuration with 
a simple "H" pseudoknot (right column). From the top to the bottom: the RNA configuration, arc 
representation and bracket representation. Note that the arc diagram for pseudoknotted RNA has 
crossing arcs and the bracket representation requires two kinds of parenthesis. 



Our model can control the topological fluctuations i.e. the formation of pseudoknots in the RNA 
molecule, and we will describe the general features of its phase diagram. In Section [5] we describe 
a Monte Carlo algorithm for the actual calculation of thermodynamical quantities in our generalized 
model. In particular we will list in details the Monte Carlo moves, the free-energy updating algorithm 
and the simulated annealing method we propose for dealing with the problem of high energy barriers. 
Section El contains the concluding remarks, and the Appendix is devoted to the explicit description of 
a part of the Monte Carlo algorithm of Section [5J 



2 Representation of RNA structures 

Any RNA sequence can be represented as the list of nucleotides e(A,C,G,U), i = 1, . . . , L, where 
is the i-th nucleotide along the oriented sugar backbone (from 5' to 3'). The ordered list {n, r2, . . . , r^} 
is called the primary structure of the RNA. 

The RNA secondary structure requires a more graphical representation. Actually there are several 
equivalent ways to represent an RNA secondary structure with a given primary structure. The most 
commonly used representation is perhaps the bracket notation, where two paired bases, rj and rj 
(i < j), are represented by parenthesis "(" and ")", and unpaired bases are represented by a dot '.' 
or a colon ':' (see Figure Pseudoknots can be described in a similar fashion, but one needs to 
introduce several different kinds of brackets (like square brackets '[', ']' or braces '{', '}', as for example 
in the database |46l I47| ). and this is not a very efficient representation for complicated structures. 

Among several other representations (e.g. mountain diagrams @S!> tree diagrams |49j . graphs [50J, 
a very general and widely used representation is the so-called dot plot diagram. It is an array where a 
dot is placed in the row i and column j if the bases r j and rj are actually paired (see Figure EJ) . This 
plot is the graphical representation of the L x L contact matrix C with elements 

| 1 if i and j are paired , ^ 
lJ 1 otherwise. 

In mathematical terms, the contact matrix C is the matrix of the permutation involution associated 
to the given set of pairings. In fact, one can always interpret the base pairing i — j as a transposition 
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5' AGUGGCCCUUCAGCACAGGCUGUU3' 




Figure 2: Representation of an RNA secondary structure with an "H" pseudoknot (left), and the 
corresponding dot plot diagram (right). 



of the elements {i, j} and therefore one can uniquely associate a permutation a to any structure by: 

/ -\ _ f j if i and j are paired , , , 

1 j otherwise. 

For example, if the primary structure is {5' - CUUCAUCAGGAAAUGAC - 3'} and the pseudo- 
knotted secondary structure is: •(((•[[[)))••]]]•) one can associate to it the permutation: 

(((•[[[)))■•]]] • 
a = | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | , (2.3) 

1 11 10 9 5 16 15 14 4 3 2 12 13 8 7 6 17 

which is also an involution since a 2 is the identity permutation. The matrix representation of a is the 
matrix D with D^ a u\ = 1 and otherwise. Obviously D = C +2, where X is the Lx L identity matrix. 
This notation is very useful for numerical implementations of the algorithm we propose in Section 
03 The advantage of the dot plot diagram is that it allows the comparison between different RNA 
secondary structures, just by superimposition as it is necessary for comparative analysis. Moreover it 
can be used for representing RNA structure with any kind of pseudoknots. 

A representation which is completely equivalent to the dot plot diagram is the disk diagram (also 
called circle plot or circular plot). In this case the RNA sequence is represented as an oriented 
circle (from 5' to 3') by virtually linking the first nucleotide to the last one. Each base pairing is 
represented as an arc inside the circle, connecting the two paired bases. Figure 01 shows a typical 
disk diagram. In this representation secondary structures without pseudoknots are purely planar 
diagrams, i.e. diagrams that can be drawn without crossing arcs, whereas pseudoknots correspond to 
structures which are not planar. This fact has been already observed by E.Rivas and S.R.Eddy in 
|32j . where they consider diagrams with arcs inside and outside the disk 1 . Crossing arcs are allowed 
but only inside or only outside but not both at the same time (so-called "overlapping pseudoknots" , 
see diagrams a) and b) of Figure QJ. As it was shown in |39l I40j . several general classes of pseudoknots 

1 More precisely, they represent the RNA sequence as an oriented straight line, and the pairings as arcs above and 
below that line. This is of course equivalent to the disk representation. 
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Figure 3: Typical disk (circle) diagram representation of an RNA secondary structure without pseu- 
doknots. The circle is anticlockwise oriented from 5' to 3'. Note that there are no crossing arcs. 

cannot be described in such a simple way (such as the diagram on the right of Figure 0J. It is then 
more convenient to draw the arcs always inside the disk (or outside, but not both) and to consider all 
the corresponding diagrams as non planar. It is precisely following this approach that the authors of 
|38j found an algorithm for computing pseudoknots with matrix field theory in a completely general 
fashion. In this paper we pursue the same analysis by considering the diagrams themselves and not 
the associated matrix field theory model. 




a) b) c) 

Figure 4: Three kind of disk diagrams for RNA secondary structures with pseudoknots. The authors of 
[3*2] consider cases of the form a) ("overlapping pseudoknots"), b) (pseudoknot present in Escherichia 
coli a mRNA j5J) but not c) (parallel /3-sheet protein interaction). The technique in jHH] can deal 
with all the three cases. 



3 The topological character of RNA pseudoknots 

There is a very natural way for classifying the "degree of non-planarity" of a given disk diagram, 
which we review here briefly. It is based on a topological analysis introduced long ago by Euler. We 
emphasize that this characterization is a well-known classical result of algebraic topology and it has 
been already introduced in j3H] for RNA secondary structures. We repeat it here in more detail for 
the convenience of the reader. 

As we have shown, in any disk diagram the RNA sequence is represented by an oriented circle. 
When the circle is drawn on a sphere, its orientation allows to distinguish an "inside" and an "outside" 
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a) b) 




c) 

Figure 5: Example of planar disk diagram. In a) the disk diagram of a double hairpin (like the one in 
Figure ^) is on a sphere. In b) the arcs are drawn all outside the circle. In c) the sphere is partitioned 
in 6 patches (5 faces and one "hole", i.e. the RNA circle). In d) is the representation of c) in double 
line notation (black thick arcs). Here j^F = 5, j^V = 4, j^E = 8, and therefore x = 1> i- e - 5 = 0. 

of the circle. One says that the circle is a "boundary" or "puncture" (as it can be drawn smaller and 
smaller, in a continuous fashion up to single point) on the sphere. Hence any (disk) planar diagram 
can be drawn on a sphere without crossing lines, simply by drawing the arcs on the same side (see 
Figure |SJ). The key observation is that the sphere is naturally partitioned in several parts by the 
diagram. As explained in it is useful to draw the arcs with a "double-line notation" (see Figure 
03)- In this way it is clear that the sphere is partitioned into several polygons. Note that all the lines 
have an orientation induced by the one of the circle. 
The Euler characteristic x °f a diagram is defined as 

X = #V-#E + #F, (3.1) 

where #V, j^E and #F are the numbers of vertices, edges, and faces, respectively. A vertex is just 
a nucleotide, an edge is any line connecting two nucleotides (either an arc joining the nucleotides, or 
the RNA sequence) and a face is that part of the surface within a closed loop of edges. Obviously, 
if there are n arcs then #E = j^V + n. A famous theorem of Euler states that any polyhedron 
homeomorphic to a sphere with a boundary (puncture) has an Euler characteristic x = 1. Therefore 
all RNA secondary structures without pseudoknots are described by disk diagrams with x = l. 

Let us discuss the case when there is a pseudoknot. For simplicity, we consider a "kissing hairpin" 
pseudoknot. In this case the corresponding disk diagram is not planar, and has crossing arcs (see for 
instance Figure EJ). After drawing the disk diagram in double line notation, and counting the number 
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of vertices, edges and faces, one gets that \ = ~ 1 this time. This value has a precise geometrical 
meaning. In fact, the Euler characteristic of a surface (or of a manifold in general) is closely related 
to its genus g, i.e. the number of "handles" of the surface. Namely if the manifold is orientable (as 
the disk diagram is, since the oriented circle line defines naturally an orientation of all the elements 
of the diagram), then one has x = 2 — 2g — c where c is the number of punctures (c = 1 in the case we 
consider here, with only one RNA strand). It follows that a kissing hairpin is represented by a disk 
diagram with genus g = 1. One concludes then, that such a disk diagram can be drawn on an oriented 
manifold with one handle, that is a torus (which is a doughnut-shaped surface formed by taking a 
cylinder and joining the two circular ends together, see Figure Ej). This procedure can be extended 
easily to cases with more complex pseudoknots. For instance, the three diagrams of Figure |1] have 
genus g = 2, g = 1 and g = 2, respectively. In Figure |B1 there is a graphical representation of all 8 
types of irreducible pseudoknots with genus g = 1 (from [331) and in Figure 03 there is some examples 
of pseudoknots with a higher genus. 

Thus we have a simple way to classify pseudoknots. This classification corresponds exactly to the 
series expansion of the partition function of the matrix model proposed in There, the series is in 
powers of the form N~ 2g , where N is the size of the matrix, and g is the genus of the corresponding 
set of diagrams. In the next section, we will exploit the same idea and show how one can control the 
topological character of pseudoknots in a statistical mechanical model for RNA secondary structures 
with pseudoknots. 



4 Statistical mechanics model of RNA structures with pseudoknots 

In almost all the energy models for RNA which have been proposed in recent years, the thermo- 
dynamical properties of a single stranded RNA are studied by means of a partition function of the 
form 



zrna = /n^- E/(W) e "^ (ClJ,{r}) -E^) 

h — 1 C i i Ci -)' 



Y j e~ 1 ^ T[E{C) ~ TmC) \ (4.1) 



where T is the temperature, ks is the Boltzmann constant, is the three-dimensional position vector 
of the k-th nucleotide, /({r}) takes into account the geometry and the constraints of the chain of 
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Figure 7: The "kissing hairpins" of Figure El can be drawn on a torus without intersections. In this 
example #F = 9,#V = 20, #E = 20 + 10, and therefore x = -1, i-e. 9 = 1- 

nucleotides, the function U takes into account the energetics coming from the pairing and stacking 
of base pairs, and the sum over Cj,- is the sum over all possible contact matrices for a given primary 
structure. The function lo(C) is proportional to the number of configurations having the same contact 
matrix C, and therefore its logarithm is just the entropy factor associated to the polymeric nature of 
the sugar-phosphate backbone. The free energy of a given configuration J-{C) = E(C) — TS(T, C) is 
the sum of several contributions, both of energetic ( E{C) ) and entropic nature ( S(C) ): Watson- 
Crick and wobble base pairs, stacking energies, terminal mismatches and dangling energies, special 
triloops and tetraloops, entropy contributions (internal loops, bulges, hairpin loops), penalty factors 
for terminal- AU in helices, for asymmetries etc. All these terms have been determined empirically, and 
they are called "Turner energy rules" jS2j. For more details see [HH1- When pseudoknots are excluded, 
the sum in eq. (|4.1I) is restricted over contact matrices that correspond to planar diagrams only. 
As we mentioned already in the introduction, the partition function Zrna without pseudoknots can 
be calculated efficiently by deterministic algorithms (dynamic programming) the most popular 
ones are perhaps the "mfold package" by M.Zuker et al. jSHl EHI and the "RNA Vienna package" by 
I.Hofacker et al. §7 2 - When pseudoknots are included, the sum in eq. (|4.1|) is unrestricted and, as we 
described in the previous Section, this leads to topology fluctuations. This situation is very common 
also in other areas of Physics (e.g. dynamical triangulations |58j . random surfaces or quantum gravity 
|59j . quantum field theory |60 ), and there are now standard ways to deal with it. The idea is to 
introduce an additional parameter /j,, which is a topological "chemical potential" , and to consider the 
partition function: 



2 They are available on-line at www.bioinfo.rpi.edu/applications/mfold/ and www.tbi.univie.ac.at/, respectively. 




(4.2) 



c, 



9 



Figure 8: List of all eight irreducible diagrams with genus g = 1 (from [10]) and their representation 
with double line notation, on the left column and right column, respectively. 

where g(C) is the genus of the configuration associated to the contact matrix C. The "chemical 
potential" \i allows a simple control over the topological character of the pseudoknots in the statistical 
ensemble at thermal equilibrium. It is also directly related to N (the size of the matrix) in the matrix 
model formulation of |38| : 

^Matrix ~ 1 + + + • ■ • , 

with /i = — 2/csTlog(A r ). The advantage here is that the energy function E(C) can be more realistic 
than the one in |38| . 

The model without chemical potential, i.e. fi = 0, corresponds to the case where there are no 
restrictions on the possible fluctuations of the topology. On the other hand when /i is very large, all 
the configurations with g > are suppressed by the Boltzmann weight, and in this case one recovers 
the planar limit (i.e. RNA secondary structures without pseudoknots). One might expect then a 
phase transition associated to the formation of pseudoknots. A natural order parameter is the average 
genus of a RNA structure with pseudoknots which can simply be recovered by taking the logarithmic 
derivative of the partition function 

d 

(g{v)) = -k B T— log Z RNA (fi) . (4.3) 

To our knowledge there are no available experimental data about the dependence of the genus of RNA 
molecules on the temperature. Informations and inputs from experiments would be highly desirable. 

Figure displays the expected phase diagram in the plane {/i, T} of our model. At high tem- 
perature, the RNA is always in a fully denaturated phase. At lower temperature and large /z the 
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Figure 9: Example of RNA pseudoknots with higher genus. The first two plots correspond to the 
diagrams a) and c) of Figure E] with genus g = 2. The third plot is an example with genus g = 3. 

secondary structures without pseudoknots are the dominating configurations. The interesting part of 
the diagram is for lower values of //, where possibly (g(fM)) ^ and pseudoknots are present. 

Even if eq. (|4.2|) can in principle deal with pseudoknotted RNA molecules, it is fair to say that for 
any realistic energy function, the model is rather unlikely amenable to an analytic solution. Moreover 
any dynamic programming approach has been shown to be computationally very demanding even for 
pseudoknots with genus g = 1 jlUJ. Hence a stochastic algorithm for studying the model eq. (|4.2|) is 
probably the only feasible way. In the next Section we describe in details a Monte Carlo algorithm 
for the simulation of the model of eq. ()4.2|) . 

5 A Monte Carlo algorithm for RNA pseudoknots prediction 

A well-known method for generating a set of configurations which are distributed according to a given 
Boltzmann weight is the Monte Carlo method. It is a standard method of modern computational 
analysis and we refer to I62| for a review and an introduction on the subject. In recent years, it 
has been also used for the prediction of RNA secondary structures in various contexts. In particular, 
our proposal can be thought of as a generalization of the Monte Carlo method described in [2] where 
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denaturated phase 




planar limit 



Figure 10: Qualitative structure of the phase diagram in the {/i, T} plane. 



the authors considered only RNA secondary structures without pseudoknots. We aim to apply this 
method to the statistical ensemble defined by eq. (|4.2[) . 

The sum over all the RNA configurations in eq. (|4.2|) contains many terms. In general, the total 
number of RNA configurations (planar and non planar configurations) grows like L! for a sequence 
of length L. The number of RNA configurations with a fixed genus g grows exponentially with L: a 
detailed analysis for the number of diagrams with genus g = (i.e. planar diagrams) can be found 
in [EH]- 3 Since the number of secondary structures on a surface with fixed genus grows exponentially, 
one expects that a brute force Monte Carlo importance sampling would be rather ineffective for a not 
too-short RNA sequence, in a reasonable amount of computation time. For that reason we decided to 
use the standard Metropolis method |64| . The Metropolis method is an efficient and simple scheme for 
generating a set of configurations distributed according to a given probability function, by means of a 
random walk in the configuration space. In our case, the Metropolis Monte Carlo method generates a 
set of n RNA configurations {C^°\ . . . ,C^}, such that ]xsn n ^ oa n c /n = P(C), where P(C) is the 
given probability distribution (e.g., the Boltzmann distribution P(C) = Z~ x exp[(£' — TS + [ig) /ksT] 
and nc is the number of configurations of type C in the statistical ensemble. Each element of 
the sequence is generated by accepting or rejecting a random configuration. In the following we give 
a complete description of the Metropolis Monte Carlo algorithm for RNA pseudoknots predictions: 

• Step 1: Pick an initial configuration C^: A simple initial configuration can be the fully denatu- 
rated state of RNA, i.e. the contact matrix is the matrix with all zero entries and the respective 

permutation involution is the identity permutation a = ( j ^ " ' £,>' ^ e variable 



n = 1. 



Step 2: Pick a trial configuration (by deforming the configuration C^ n Such an opera- 
tion is called "Monte Carlo move" C^™" 1 ) — > C^ n \ Compute the probability ratio 

- P ( C( "'» (5.1) 



P(C("- 1 )) 



Pick a random number x with value between and 1. If x < p accept the configuration 

as the new configuration. Otherwise refuse it and keep C^ n ~ 1 ^ as new configuration, i.e. put 

C( n ) = (^C™- 1 ) _ Increase the variable n by one. 



3 An analysis for structure with higher genus similar to the one in |63| is still lacking. We expect that the matrix field 



theory model introduced in |38| can shed some light on this issue. 
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• Step 3: repeat Step 2 for n max times, where n max is a sufficiently large number. 

The most relevant aspect of this method is that, at large n max , one can generate an ensemble of 
configurations with the probability distribution P(C) = Z~ x exp[-(E(C)-TS(T, C)+fig(C))/{k B T)], 
simply by computing probability ratios. Therefore, this method is extremely useful as it avoids the 
need of computing the partition function Z of the system, a computational task that would be surely 
intractable for long RNA sequences. 



E\ 



e -AE/k b T 




c 

Figure 11: The Metropolis algorithm accepts a configuration with lower energy with probability P = 1. 
It can also accept a configuration with higher energy, with probability P = e ~ AE / K b T ^ where AE is 
the energy difference. 



5.1 Configurational changes (Monte Carlo moves) 

At lar ge n max , the above algorithm is guaranteed to generate a set of configurations with the probabil- 
ity distribution P(C), under few assumptions. Two essential requirements are that the Monte Carlo 
moves have to be ergodic and satisfy the so-called detailed balance condition. Ergodicity essentially 
means that every point in the configuration space can be reached in a finite number of Monte Carlo 
steps from any other point. The detailed balance condition here simply means that the Monte Carlo 
moves are symmetric, i.e. the probability of proposing a Monte Carlo move C — ► C is the same as of 
proposing the move C — > C. 

We describe now the Monte Carlo moves for RNA folding. First, at the beginning of the simulation, 
it is useful to make some book-keeping by storing in the memory the list of all the allowed base-pairs 
(i.e. that are only of the type A»U, C»G or G»U). Such an information can be stored in L vectors 
li, i = 1, . . . , L, as follows: the nucleotide in position i can be paired to rij possible other nucleotides, 
namely with the ones in position /j(l),^(2), . . . , Zj(nj) and nothing else. For example, if the primary 
structure is {AGCU} then we have: 

h = [A], i 2 = [3,4], l 3 = [2], h = [l,2]. (5.2) 

The creation of such a list of possible base-pairs does not slow down the total algorithm since it is 
an 0(L 2 ) operation which is done only once. Now we want to extract one element from the list of 
L vectors with uniform probability. This can be done as follows. Let T = ^2 h nh, and let pick up a 
uniform integer random number r between 1 < r < T. Then take the highest integer number i such 
that J2h=i n h < r > an d define y = r — Ylh=i n h + 1- Obviously 1 < y < T holds true. Consider the pair 
of bases i and j = h{y). The base-pair i — j has been extracted randomly with uniform probability 
in the set of all possible base-pairs, for the given RNA sequence. The Monte Carlo move C —> C is 
then generated as follows: 
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• If the configuration C is such that both the base in i and in j are free (i.e. cxc(i) = i and 
a c(j) = j) then add the link i — j (i.e. put ac{i) = j and uc'{j) = £)• We cau this Monte Carlo 
move "add a base pair" (see case 1 of figure IT^]). 

• If the configuration C is such that there is arc between i and j (i.e. crcif) = j) then remove the 
link i — j (i.e. put ac(i) = i and ac'(j) = j)- We call this Monte Carlo move "remove a base 
pair" (see case 2 of figure ^J). 

• If the configuration C is such that either the base in i or the base in j is linked to some other 
base, (i.e. ac(i) = i and ac{j) ^ j, or ac(j) = 3 an d &c(i) 7^ i) then move the link back to 
* — Ji by overriding any former link (i.e. put ac(i) = j and <Jc{j) = i)- We call this Monte 
Carlo move "shift a base pair" (see case 3 and 4 of figure ^J). 

• If the configuration C is such that the base i is linked to an other base k\ and j is linked to 
an other base &2 and the base-pair k\ — &)2 is possible, then swap the links (i.e. put ac(i) = j, 
a C'U) = h a C'(ki) = k2, o"c(^2) = ki). We call this Monte Carlo move "swap a base pair" (see 
case 5 and 6 of figure IT2|) . 

• If none of the above cases applies then do not update the configuration, i.e. put C = C. 
See Figure^] for a summary of these Monte Carlo moves, and Figure [TBI for a simple example. 



C C 




Figure 12: Monte Carlo moves for an allowed base pair i — j of a RNA secondary structures with 
pseudoknots. The move 1) adds a base pair. The move 2) removes a base pair. The moves 3) and 4) 
shift a base pair. The move 5) and 6) swap two base-pairs, when possible. 

These Monte Carlo moves obviously satisfy the detailed balance condition. In fact the probability 
of creating a link between i and j when i or j are link- free (or at least one of them), or of removing the 
link when they are already linked is always = 2/T, and thus it is symmetric. In the case where i 



14 



and j are already linked to different bases a(i) and cr(j), then a link is put between i and j only if a{i) 
can be connected to a{j) as well. In this case the reverse move also occurs with the same probability 
rate, thus it is a symmetric move. Moreover, the set of Monte Carlo moves are ergodic. The key 
observation is that such moves correspond to transpositions in the space of permutation involutions. 
Since any configuration of RNA secondary structure with pseudoknots can be uniquely represented by 
a permutation involution, and since any permutation can be obtained by a suitable finite sequence of 
transpositions |65| . it follows that any RNA secondary structure with pseudoknots can be generated 
with a finite number of the Monte Carlo moves described above. 
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Figure 13: Space of the configurations for the sequence {^4, U, A, U}. The arrows indicate the Monte 
Carlo moves and their probability rate. 



Few comments are in order. First of all, other sets of Monte Carlo moves are possible of course. 
Several authors introduced collective moves, where several links are updated at the same time (as 
opposed to one by one as we propose). The advantages are a general speed-up of the computing time, 
and a more effective simulation as far as overcoming the energy barriers. In the present work, we 
prefer to keep the code as simple as possible by using a set of "local" moves, and to focus on testing 
its effectiveness when dealing with RNA pseudoknots. Second, both the generation of the Monte Carlo 
moves and the Metropolis method require a good (pseudo)random number generator, in order to avoid 
biases in the output which may be very difficult to detect. For a good introduction to random number 
generators we refer the reader to ;66! and |67| . Finally, as in all stochastic algorithms, one has to be able 
to estimate the statistical errors of the Monte Carlo prediction. As this method generates an ensemble 
of configurations {C^°\ . . . , C^ nmax ^}, distributed according to the probability distribution P(C), 
then one can compute ensemble averages of any quantity A(C) simply by: 

i<"max -. 

i=l 

The error associated to this observable scales like 1/yN where N is the number of independent 
measurements. It is important to note that N is not usually equal to n max since in general the 
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configurations generated by any Monte Carlo algorithm are correlated. One can deal with this issue in 
two ways. One can compute the autocorrelation length £ of the sequence of the RNA configurations 
generated by the Monte Carlo algorithm and then subsample the same set of configurations, keeping 
one configuration every £ and skipping all the configurations in between |68| . An other possibility is 
to keep all the configurations of the sequence, and compute the statistical error by taking into account 
the existence of correlations (The error is usually bigger than the simple standard deviation of the 
data). A well-known technique for computing the statistical error of a set of correlated data is the 
so-called jackknife method. For an introduction to this method, we refer the reader to (OH]- There are 
also other techniques which can be found in |7U| , 

5.2 Energy update 

According to the Metropolis method, the ratio p (Step 2 of the algorithm, eq. (|5.1|) ) is given by 

1 



p = exp 

where 



k B T 



{AE-TAS + pAg) 



(5.4) 



AE = E{C {n) ) -E(C (n - iy ), 
AS = S(C^) - S(C( n ~V) , 
Ag = g (C^)-g(C^). 

Since the Monte Carlo moves are local (i.e. they involve only a small part of the RNA sequence) the 
computation of AE, AS and Ag is usually easier and faster than computing the full functions E{C), 
5(C) and g(C). 

We consider first Ag, and we provide an efficient algorithm for computing it. According to eq. 
(jSHJl, the genus is given by g = (1 - #F + #E - #F)/2 = (1 - L + (L + n arcs ) - n loops )/2. Therefore: 

« 1 ~)~ Afi arcs — Ani 00 „ s 

Ag = , (5.5) 

where 

{—1 for a "remove the base-pair i — j" move, 

1 for a "add the base pair i — j" move, (5-6) 
for a "shift or swap the base pair i — j" move. 

The difference Arii oops can be computed by considering the loops containing the bases i and j. In 
principle there are 4 possible independent loops, two about i and two about j (see Figure ITi|) . The 
connectivity of the RNA molecule can be such that the four loops are not independent. In Appendix lAl 
we describe an algorithm for computing the actual number of independent loops. Then it is sufficient 
to run the algorithm over and C^™ -1 ) and compute the difference of loops, that is Ani oops . 

Secondly, the calculation of AE is more problematic. There is not yet an RNA energy model for 
any given topology. The most studied and well-defined model both from a theoretical and experimental 
point of view is for the spherical topology (i.e. genus= 0, that is RNA secondary structure without 
pseudoknots 1 [7T] . The set of empirical "Turner" energy rules can be generalized in order to describe 
some simple class of pseudoknots 02] but the general case for any topology is still lacking. For AS 
the situation is slightly better, as one can model the configurational entropy of the RNA structure by 
using the theory of polymers (as already presented in |16j or |42j ) and the inclusion of pseudoknots is 
in principle feasible. Therefore, in our numerical simulation we will use the "Turner" energy model, 
and even if this is not quite appropriate for higher topology, we expect the corrections to be small 
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J L J L 

< j 




Figure 14: Two given bases i and j usually belong to 4 loops, when drawing the arc with the double- 
line notation. The loops are not always independent. Here there are two examples: a case with 2 
independent loops (top), and a case with only one independent loop (bottom). 

with respect to the over-all energy scale. We remind the reader that the purpose of this paper is to 
propose a new approach for the study of pseudoknots formation in RNA secondary structure. Thus at 
this point, it is reasonable to perform a preliminary analysis based on an approximate energy model. 
When a more complete energy model (including all the topologies) will be available, it will be sufficient 
to include it in the calculation of AE in our algorithm. 

5.3 The "simulated annealing" method 

One of the major problems about the Monte Carlo simulation of RNA folding, is that the energy 
landscape is usually very rough with metastable valleys separated by energy barriers which are "high" 
compared to the energy involved in each Monte Carlo move. This is a general situation in thermody- 
namic systems with many degrees of freedom (e.g. glasses, polymers, proteins etc.). where in addition 
to the global minimum energy configuration there may be many local minima separated by high en- 
ergy barriers. The worst consequence is that the system can be trapped for a long time in such local 
minima and the Monte Carlo exploration of the energy landscape is no longer effective. In order to 
over come this problem with RNA M.Schmitz and G. Steger in [2] proposed the use of a computational 
technique called "simulated annealing" method. It is a classical method which has been introduce for 
finding the minimum energy configuration of a system with a very rough energy landscape [22]. We 
briefly describe the algorithm: 

• Step 1: generalize the partition function eq. (|4.2[) to the form: 

Zrna = £ e -^^)-™(^) +w (c)] (5 ?) 

Cij 

and initialize O = & m ax > T. 

• Step 2: Starting from an initial configuration (e.g., the fully denaturated RNA configura- 
tion) sample n configurations by means of the Metropolis Monte Carlo method applied to eq. 

• Step 3: Go to Step 2, and replace 6 by a lower value, and by C l( - n \ Repeat this step until 
the temperature of the system is equal to T. During the Monte Carlo process keep track all the 
time of the configuration with the lowest energy. 

One can show that usually, the global minimum can be obtained by using a logarithmic rate |73| . In 
practice, other annealing schedules are possible: linear, hyperbolic, exponential, power-law schedules 
are often implemented. 
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Assuming that at low temperature an RNA molecule assumes a configuration which corresponds 
to the minimum energy, we can find such a configuration by using the simulated annealing method, 
starting the simulation with a value of G well above the melting temperature (say a few hundred 
degrees Celsius) . A first check of this method is whether we can reproduce the results produced by 
deterministic algorithms such as "mfold" |55| or the "Vienna Package" [S7j. For that purpose, it is 
sufficient to use the "Turner" energy model and run our algorithm with a large value of the chemical 
potential /i. Our preliminary tests showed that the minimum can be easily found for sequences with 
length up to around 300 bases. For longer RNA sequences, the simulation time increases and the 
minimum may be harder to find. In this cases we can use an additional feature of our model. In fact 
our approach offers also the interesting possibility of using the chemical potential for overcoming the 
energy barriers. It means that we can apply a "simulated annealing" method on [i rather than on T. 
Thus, starting with a low value of \i (where all the topologies with any genus are allowed) the Monte 
Carlo simulation can quickly explore regions which are very distant from each other in the energy 
landscape. Then by slowly increasing the value of \i we gradually constrain the simulation to select 
only planar configurations (i.e. secondary structures without pseudoknots), and the minimum energy 
configuration eventually. During this process, that is for intermediate values of /i, many configurations 
in thermal equilibrium are generated, and in general they correspond to diagrams with (g) ^ 0, i.e. 
RNA configurations with pseudoknots. These configurations are the prediction of our algorithm and 
they should be compared with the experimental data. It is at this level that the value of [i can be 
tuned, in order to fit the data. The results of our preliminary investigations in this region of the 
phase diagram are very encouraging and promising. However, in this paper we limit ourselves to the 
description of this new method and of our algorithm. The results of our simulation will be published 
shortly. 

6 Conclusions 

In this paper we propose a new approach to the problem of RNA folding with pseudoknots. We start 
from a classifications of RNA pseudoknots based on their graphical representation by means of disk 
diagrams. A generic disk diagram is usually not planar, i.e. cannot be drawn on a plane surface without 
crossing lines. However, if the surface has a high enough genus (i.e. a sufficient number of "handles"), 
the diagram can always be drawn on that surface without any crossing. The precise correspondence 
is obtained by using a famous theorem by Euler, and it precisely corresponds to the topological 
classifications of RNA pseudoknots already introduced in jHH]. Then we propose a statistical mechanics 
model where the formation of RNA pseudoknots is associated with fluctuations of the topology (eq. 
(j4.2j) ). In order to do that we introduce a parameter, the topological "chemical potential", which 
controls the rate of pseudoknots formation, and can be obtained by fitting experimental data. We then 
discuss the qualitative structure of the phase diagram for the RNA molecule in the plane {fi, T} and its 
interpretation. Finally we describe a Monte Carlo algorithm for the prediction of RNA pseudoknots. 
It is based on a standard Metropolis algorithm coupled to the "simulated annealing" method and 
we provide an explicit description of its implementation and use. A numerical investigation of this 
technique and the phase diagram is under way and will be published shortly. 
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try: theory and applications" at Les Houches (France) on March 2004, where this work has been first 
presented. GV acknowledges the support of the European Fellowship MEIF-CT-2003-501547. 
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A An algorithm for computing ni oops 



In this Appendix we describe an algorithm for computing the number of independent loops adjacent 
to the i-th and j-th nucleotides. It is useful for computing the variation of the genus when one of the 
Monte Carlo moves we described in Section [5J The algorithm we propose for counting the number of 
independent loops is based on tracking a walk along the diagram starting from the base i and marking 
the loop with an identifying number (or color). Namely, we represent the configuration C by means 
of the permutation involution ac (as described in Section EJ), and the algorithm is: 

START 



V {. 1) —VK.Z) — V K.O) — V K.'i) — U 




% set the four flags to zero 


pos=i 




% the start position is i — th base 


COlor = l 




% using color 1 


v(l)=color 




% mark the first flag with the color in use 


do{ 




% start the first loop 


pos=sigma(pos) 




% follow the permutation involution 


if pos==i then v(2)= 


color 


% check if it is either in i or j 


if pos==j then v(4)= 


color 




pos=shif t (pos) 




% shift move (along the RNA circle) 


if pos==j then v(3)= 


color 


% check if it is in i again 


} while (position ! =i) 




% repeat until it returns at the starting point 


if v(2)==0 then{ 




% check if the second loop has been marked already 


color=color+l 




% if yes , change color 


pos=i 




°/ start again from i-th base 


v(2)=color 




% mark the second flag 


do{ 




% repeat all the above for the second loop (at i) 



pos=shif t (pos) 

if pos==j then v(3)=color 

pos=sigma(pos) 

if pos==j then v(4)=color 

} while (pos!=i) 



if v(3)==0 then{ % repeat all the above for the third loop (at j) 

color=color+l 
pos=j 

v(3)=color 
do{ 

pos=sigma(pos) 

if pos==j then v(4)=color 

pos=shif t (pos) 

} while (pos!=j) 



if v(4)==0 then{ % the fourth loop at j is independent from the 

color=color+l % previous ones if and only if it has not been 

} % marked yet 
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nloops=color % the number of independent loops is the number 

% of used colors 

END 

L is the length of the RNA sequence and the function shift(pos) is just the increment by 1 of the 
variable pos with period L, i.e. shift(pos)=remainder(pos,L)+l. At the end of the algorithm the 
variable "color" contains the number of independent loops ni oops . The algorithm runs in a time 
proportional to 0(L). 
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